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^N A key challenge in non-cooperative multi-agent systems is that of developing effi- 

QQ cient planning algorithms for intelligent agents to interact and perform effectively 

T-H among boundedly rational, self-interested agents (e.g., humans). The practicality 

of existing works addressing this challenge is being undermined due to either the 
I i' restrictive assumptions of the other agents' behavior, the failure in accounting for 

.^ their rationality, or the prohibitively expensive cost of modehng and predicting 

their intentions. To boost the practicality of research in this field, we investigate 
how intention prediction can be efficiently exploited and made practical in plan- 
ning, thereby leading to efficient intention-aware planning frameworks capable of 
predicting the intentions of other agents and acting optimally with respect to their 
i-H predicted intentions. We show that the performance losses incurred by the result- 

K*" ing planning policies are linearly bounded by the error of intention prediction. 

'^^ Empirical evaluations through a series of stochastic games demonstrate that our 

policies can achieve better and more robust performance than the state-of-the-art 
algorithms. 
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^ 1 Introduction 

^H 

L| A fundamental challenge in non-cooperative multi-agent systems (MAS) is that of designing intel- 

• ^H ligent agents that can efficiently plan their actions under uncertainty to interact and perform effec- 

^^ tively among boundedly rational^ self-interested agents (e.g., humans). Such a challenge is posed 

H by many real- world applications |7|, which include automated electronic trading markets where 

^ software agents interact, and traffic intersections where autonomous cars have to negotiate with 

human-driven vehicles to cross them, among others. These applications can be modeled as partially 
observable stochastic games (POSGs) in which the agents are self-interested (i.e., non-cooperative) 
and do not necessarily share the same goal, thus invalidating the use of planning algorithms devel- 
oped for coordinating cooperative agents (i.e., solving POSGs with common payoffs) I ITI [TSl [161 . 
Existing planning frameworks for non-cooperative MAS can be generally classified into: 

Game-theoretic frameworks. Based on the well-founded classical game theory, these multi-agent 
planning frameworks iTl l9]r]characterize the agents' interactions in a POSG using solution concepts 
such as Nash equilibrium. Such frameworks suffer from the following drawbacks: (a) Multiple 
equilibria may exist, (b) only the optimal actions corresponding to the equilibria are specified, and 
(c) they assume that the agents do not collaborate to beneficially deviate from the equilibrium (i.e., 
no coalition), which is often violated by human agents. 



'Boundedly rational agents are subject to limited cognition and time in making decisions fSl 
^The learning 
is known a priori 



^The learning framework of Hu and Wellman [1998] trivially reduces to planning when the transition model 
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Decision-theoretic frameworks. Unafflicted by the drawbacks of game-theoretic approaches, they 
extend single-agent decision-theoretic planning frameworks such as the Markov decision process 
(MDP) and partially observable Markov decision process (POMDP) to further characterize inter- 
actions with the other self-interested agents in a POSG. In particular, the interactive POMDP (I- 
POMDP) framework 13] IH [6l [141 is proposed to explicitly account for the bounded rationality of 
self-interested agents: It replaces POMDP's flat beliefs over the physical states with interactive be- 
liefs over both the physical states and the other agent's beliefs. Empowered by such an enriched, 
highly expressive belief space, I-POMDP can explicitly model and predict the other agent's intention 
(i.e., mixed strategy) under partial observability. 

However, solving I-POMDP is prohibitively expensive due to the following computational difficul- 
ties mi^): (a) Curse of dimensionality - since I-POMDP's interactive belief is over the joint space 
of physical states and the other agent's beliefs (termed interactive state space in |4|), its dimension 
can be extremely large and possibly infinite; (b) curse of history - similar to POMDP, I-POMDP's 
policy space grows exponentially with the length of planning horizon; and (c) curse of nested rea- 
soning - as I-POMDP utilizes a nested structure within our agent's belief space to represent its 
belief over the other agent's belief and the other agent's belief over our agent's belief and so on, it 
aggravates the effects of the other two curses |4| . 

To date, a number of approximate I-POMDP techniques 13] ID [pi have been proposed to mitigate 
some of the above difficulties. Notably, Interactive Particle Filtering (I-PF) |3 1 focused on alleviat- 
ing the curse of dimensionality by generalizing the particle filtering technique to accommodate the 
multi-agent setting while Interactive Point-based Value Iteration |4| (I-PBVI) aimed at relieving the 
curse of history by generalizing the well-known point-based value iteration (PBVI) ifTSl to operate 
in the interactive belief space. Unfortunately, I-PF fails to address the curse of history and it is not 
clear how PBVI or other sampling-based algorithm can be modified to work with a particle repre- 
sentation of interactive beliefs, whereas I-PBVI suffers from the curse of dimensionality because its 
dimension of interactive belief grows exponentially with the length of planning horizon of the other 
agent (Section [3]l. Using interactive beliefs, it is therefore not known whether it is even possible to 
jointly lift both curses, for example, by extending I-PF or I-PBVI. Furthermore, they do not explic- 
itly account for the curse of nested reasoning. As a result, their use has been restricted to small, 
simple problems lfT2l (e.g., multiagent Tiger lISlfTTlM'). 

To tractably solve larger problems, existing approximate I-POMDP techniques such as I-PF and I- 
PBVI have to significantly reduce the quality of approximation and impose restrictive assumptions 
(Section [3]), or risk not producing a policy at all with the available memory of modern-day com- 
puters. This naturally raises the concern of whether the resulting policy can still perform well or 
not under different partially observable environments, as investigated in Section l5] Since such dras- 
tic compromises in solution quality are necessary of approximate I-POMDP techniques to tackle a 
larger problem directly, it may be worthwhile to instead consider formulating an approximate ver- 
sion of the problem with a less sophisticated structural representation such that it allows an exact 
or near-optimal solution policy to be more efficiently derived. More importantly, can the induced 
policy perform robustly against errors in modeling and predicting the other agent's intention? If 
we are able to formulate such an approximate problem, the resulting policy can potentially perform 
better than an approximate I-POMDP policy in the original problem while incurring significantly 
less planning time. 

Our work in this paper investigates such an alternative: We first develop a novel intention-aware 
nested MDP framework (Section |2]l for planning in fully observable multi-agent environments. In- 
spired by the cognitive hierarchy model of games 11], nested MDP constitutes a recursive reasoning 
formalism to predict the other agent's intention and then exploit it to plan our agent's optimal interac- 
tion policy. Its formalism is by no means a reduction of I-POMDP. We show that nested MDP incurs 
linear time in the planning horizon length and reasoning depth. Then, we propose an I-POMDP Lite 
framework (Sectionl3]l for planning in partially observable multi-agent environments that, in partic- 
ular, exploits a practical structural assumption: The intention of the other agent is driven by nested 
MDP, which is demonstrated theoretically to be an effective surrogate of its true intention when 
the agents have fine sensing and actuation capabilities. This assumption allows the other agent's 
intention to be predicted efficiently and, consequently, I-POMDP Lite to be solved efficiently in 
polynomial time, hence lifting the three curses of I-POMDP. As demonstrated empirically, it also 
improves I-POMDP Lite's robustness in planning performance by overestimating the true sensing 



capability of the other agent. We provide theoretical performance guarantees of the nested MDP and 
I-POMDP Lite policies that improve with decreasing error of intention prediction (Section |4]i. We 
extensively evaluate our frameworks through experiments involving a series of POSGs that have to 
be modeled using a significantly larger state space (SectionB). 

2 Nested MDP 

Given that the environment is fully observable, our proposed nested MDP framework can be used 
to predict the other agent's strategy and such predictive information is then exploited to plan our 
agent's optimal interaction policy. Inspired by the cognitive hierarchy model of games |JJ, it consti- 
tutes a well-defined recursive reasoning process that comprises k levels of reasoning. At level of 
reasoning, our agent simply believes that the other agent chooses actions randomly and computes its 
best response by solving a conventional MDP that implicitly represents the other agent's actions as 
stochastic noise in its transition model. At higher reasoning levels fc > 1, our agent plans its optimal 
strategy by assuming that the other agent's strategy is based only on lower levels 0, 1, . . . , fc — 1 of 
reasoning. In this section, we will formalize nested MDP and show that our agent's optimal policy 
at level fc can be computed recursively. 

Nested MDP Formulation. Formally, nested MDP for agent t at level fc of reasoning is defined as a 

tuple A//^ = (S*, C/, V, T, i?, {ttIjIjZq^, 0) where 5 is a set of all possible states of the environment; U 
and V are, respectively, sets of all possible actions available to agents t and -t;T : S xU xVx S —^ 
[0, 1] denotes the probability Pr{s'\s,u,v) of going from state s E S to state s' e S using agent 
i's action u E U and agent -t's action v E V; R : S x U x V —> R is a reward function of agent 
t; TT'tf. : S X V ^- [0, 1] is a reasoning model of agent -t at level i < fc, as defined later in (Bji; and 
<j> G (0, 1) is a discount factor. 

Nested MDP Planning. The optimal {h + l)-step-to-go value function of nested MDP M^ at level 
fc > f or agent t satisfies the following Bellman equation: 



Ut'''^^{s) ^ max V ?^t(s,w) Q^'*+^(s,u, u) 
vev 
Q^ '*' {s,u,v) = R{s,u,v) + (j) 2J T{s,u,v,s') U^ ' \s') 



v<£V (1) 



where the mixed strategy tt.j of the other agent -t f or fc > is predicted as 

I 1^1 otherwise. 

where the probability p{i) (i.e., J2i=o P(*) ~ ^^ specifies how likely agent -t will reason at level i; a 
uniform distribution is assumed when there is no such prior knowledge. Alternatively, one possible 
direction for future work is to learn p{i) using multi-agent reinforcement learning techniques such 
as those described in [2, 8|. At level 0, agent -i's reasoning model 7t% is induced by solving Af.°. 
To obtain agent -t's reasoning models {Trlt}iZi ^t levels z = 1, . . . , fc — 1, let Opt\{s) be the set 
of agent -f's optimal actions for state s induced by solving its nested MDP Ml^, which recursively 
involves building agent f's reasoning models {7rJ};~Q at levels I = 0, 1, ... ,i — 1, by definition. 
Then, 

^, . A / \OptUsr' ifv&Optl.is), 

I otherwise. 



After predicting agent -i's mixed strategy vr^^ B, agent i's optimal policWi.e., reasoning model) n. 
at level fc can be induced by solving its corresponding nested MDP M^ M. 



Time Complexity. Solving M^ involves solving {Af.\}jZg^, which, in turn, requires solving 
{Ml}'^~^, and so on. Thus, solving M^ requires solving Ml (i = 0, . . . , fc - 2) and Af*j 
{i = 0, . . . , fc — 1), that is, 0{k) nested MDPs. Given tt^^, the cost of deriving agent i's opti- 
mal policy grows linearly with the horizon length h as the backup operation (fill has to be performed 
h times. In turn, each backup operation incurs 0(|5p) time given that \U\ and \V\ are constants. 

Then, given agent -fs profile of reasoning models {Trlt}iZo, predicting its mixed strategy 7r^t(s, v) 



(KJi incurs 0{k) time. Therefore, solving agent t's nested MDP M^ (mi or inducing its corresponding 
reasoning model tt^' incurs 0(^kh\S\^). 

3 Intention-Aware POMDP 

To tackle partial observability, it seems obvious to first consider generalizing the recursive reasoning 
formalism of nested MDP. This approach yields two practical complications: (a) our agent's belief 
over both the physical states and the other agent's beliefs (i.e., a probability distribution over prob- 
ability distributions) has to be modeled, and (b) the other agent's mixed strategy has to be predicted 
for each of its infinitely many possible beliefs. Existing approximate I-POMDP techniques address 
these respective difficulties by (a) using a finite particle representation like I-PF |3 1 or (b) constrain- 
ing the interactive state space IS to IS' = S x Reach(B, h) like I-PBVI |4| where Reach(i3, h) 
includes the other agent's beliefs reachable from a finite set B of its candidate initial beliefs over 
horizon length h. 

However, recall from Section[T|that since I-PF suffers from the curse of history, the particle approx- 
imation of interactive beliefs has to be made significantly coarse to solve larger problems tractably, 
thus degrading its planning performance. I-PBVI, on the other hand, is plagued by the curse of 
dimensionality due to the need of constructing the set Reach(i?, h) whose size grows exponentially 
with h. As a result, it cannot tractably plan beyond a few look-ahead steps for even the small test 
problems in Section l5] Furthermore, it imposes a restrictive assumption that the true initial belief 
of the other agent, which is often not known in practice, needs to be included in B to satisfy the 
absolute continuity condition of interactive beliefs |4| (see Appendix C for more details). So, I- 
PBVI may not perform well under practical environmental settings where a long planning horizon 
is desirable or the other agent's initial belief is not included in B. For I-PF and I-PBVI, the curse of 
nested reasoning aggravates the effects of other curses. 

Since predicting the other agent's intention using approximate I-POMDP techniques is prohibitively 
expensive, it is practical to consider a computationally cheaper yet credible information source pro- 
viding its intention such as its nested MDP policy. Intuitively, such a policy describes the intention 
of the other agent with full observability who believes that our agent has full observability as well. 
Knowing the other agent's nested MDP policy is especially useful when the agents' sensing and 
actuation capabilities are expected to be good (i.e., accurate observation and transition models), as 
demonstrated in the following simple result: 

Theorem 1. Let Q-t{s, v) — \U\~^ X^uec/ Q-i"i^^ "^^ ^) '^"'^ Q^ti^^ ^) denote n-step-to-go values 
of selecting action v €z V in state s £ S and belief b, respectively, for the other agent -t using nested 
MDP and I-POMDP at reasoning level (i.e., MDP and POMDP). Ifb{s) > I - e and 



y{s,v,u)3{s',o)Pr{s'\s,v,u)>l-^ A Pr{o\s,v)>l - ^ 



for some e > 0, then 



^^^j i?max i?imn I ^^^ 



where i?max t^nd i?min denote agent -t's maximum and minimum immediate payoffs, respectively. 



Its proof is given in Appendix A. Following from Theorem [T] we conjecture that, as e decreases 
(i.e., observation and transition models become more accurate), the nested MDP policy ttPj of the 
other agent is more likely to approximate the exact I-POMDP policy closely. Hence, the nested 
MDP policy serves as an effective surrogate of the exact I-POMDP policy (i.e., true intention) of the 
other agent if the agents have fine sensing and actuation capabilities; such a condition often holds 
for typical real-world environments. 

Motivated by the above conjecture and Theorem[T] we propose an alternative I-POMDP Lite frame- 
work by exploiting the following structural assumption: The intention of the other agent is driven 
by nested MDP. This assumption allows the other agent's intention to be predicted efficiently by 
computing its nested MDP policy, thus lifting I-POMDP's curse of nested reasoning (Section |2|. 
More importantly, it enables both the curses of dimensionality and history to be lifted, which makes 
solving I-POMDP Lite very efficient, as explained below. Compared to existing game-theoretic 
frameworks ll9l [T0l which make strong assumptions of the other agent's behavior, our assumption is 



clearly less restrictive. Unlike the approximate I-POMDP techniques, it does not cause I-POMDP 
Lite to be subject to coarse approximation when solving larger problems, which can potentially 
result in better planning performance. Furthermore, by modeling and predicting the other agent's 
intention using nested MDP, I-POMDP Lite tends to overestimate its true sensing capability and 
can therefore achieve a more robust performance than I-PB VI using significantly less planning time 
under different partially observable environments (SectionlSl). 

I-POMDP Lite Formulation. Our I-POMDP Lite framework constitutes an integration of the 
nested MDP for predicting the other agent's mixed strategy into a POMDP for tracking our agent's 
belief in partially observable environments. Naively, this can be achieved by extending the belief 
space to A (5* X V) (i.e., each belief b is now a probability distribution over the state-action space 
S X V) and solving the resulting augmented POMDP. The size of representing each belief therefore 
becomes OdSJlV^I) (instead of 0(151)), which consequently increases the cost of processing each 
belief (i.e., belief update). Fortunately, our I-POMDP Lite framework can alleviate this extra cost: 
By factorizing b{s, v) — b{s) 7r^j(s, v), the belief space over S x V can be reduced to one over S 
because the predictive probabilities ntf(s,v) O) (i.e., predicted mixed strategy of the other agent) 
are derived separately in advance by solving nested MDPs. This consequently alleviates the curse of 
dimensionality pertaining to the use of interactive beliefs, as discussed in Section [T] Furthermore, 
such a reduction of the belief space decreases the time and space complexities and typically allows 
an optimal policy to be derived faster in practice: the space required to store n sampled beliefs is 
only 0(n|5| + |S'||y|) instead of C'(n|S'||y|). 

Formally, I-POMDP Lite (for our agent t) is defined as a tuple {S, U, V, O, T, Z, i?, vr^^ , (/>, 60) where 
5 is a set of all possible states of the environment; U and Y are sets of all actions available to our 
agent t and the other agent -t, respectively; O is a set of all possible observations of our agent i; 
T: SxUxVxS-^ [0, 1] is a transition function that depends on the agents' joint actions; 
Z : 5 X [/ X O ^^ [0, 1] denotes the probability Pr(o|s', u) of making observation o e O in state 
s' d S using our agent f s action u E U; R : S x U x V ^ M. is the reward function of agent 
i; TT^t : 5 X y — >■ [0, 1] denotes the predictive probability Pr{v\s) of selecting action v in state s 
for the other agent -t and is derived using (J2| by solving its nested MDPs at levels 0, . . . ,k — 1; 
e (0, 1) is a discount factor; and bo e A(S7 is a prior belief over the states of environment. 

I-POMDP Lite Planning. Similar to solving POMDP (except for a few modifications), the optimal 
value function of I-POMDP Lite for our agent t satisfies the below Bellman equation: 



Vr 



;+i(6) = max (i?(6, u) +4>Y,Pr{v, o\b, u) K(6')) (5) 



where our agent fs expected immediate payoff is 



i?(6,u) = Vi?(s,u,w)Pr(i)|s)6(s) (6) 



s.v 



and the belief update is given as 

b'{s') == P Z{s', u, o) V T{s, u, V, s') Pr{v\s) b{s) 



Note that Q yields an intuitive interpretation: The uncertainty over the state of the environment can 
be factored out of the prediction of the other agent -t's strategy by assuming that agent -t can fully 
observe the environment. Consequently, solving I-POMDP Lite (J5]l involves choosing the policy that 
maximizes the expected total reward with respect to the prediction of agent -t's mixed strategy using 
nested MDP. Like POMDP, the optimal value function Vn (b) of I-POMDP Lite can be approximated 
arbitrarily closely (for infinite horizon) by a piecewise-linear and convex function that takes the form 
of a set Kjjof a vectors: 

y„(6) = max(Q-5) . (7) 

aEVn 

Solving I-POMDP Lite therefore involves computing the corresponding set of a vectors that can 
be achieved inductively: given a finite set Vn of a vectors, we can plug (J7]i into (J5]l to derive Ki+i 
(see Theorem [3] in Section HI). Similar to POMDP, the number of a vectors grows exponentially 

'with slight abuse of notation, the value function is also used to denote the set of corresponding a vectors. 



with the time horizon: |K+i| = |t/||K|'^"*^'- To avoid this exponential blow-up, I-POMDP 
Lite inherits essential properties from POMDP (Section |4| that make it amenable to be solved by 
existing sampling-based algorithm such as PBVI |13| used here. The idea is to sample a finite set B 
of reachable beliefs (from 60) to approximately represent the belief simplex, thus avoiding the need 
to generate the full belief reachability tree to compute the optimal policy. This alleviates the curse of 
history pertaining to the use of interactive beliefs (Section Til. Then, it suffices to maintain a single 
a vector for each belief point b E B that maximizes Vn{b). Consequently, each backup step can be 
performed in polynomial time: 0(|C/||V^||0||i?p|S'|),as sketched below: 

BACKUP(F„,S) 

1. r"'* ^ a"'*(s) = J2pRis,u,v) Pr{v\s) 

2. r"-"^° ^ Va^ e 14 af'^^'^s) = <P Pr{v\s) ^^, Z{s' , u, o) T{s, u, v, s') a^(s') 

3. rj; ^ r"'* +E„.oargmax„gr„,„,„(a • 6) 

4. Return K+i ^^h e B argmaxp,, y„gy(r^ • b) 

Time Complexity. Given the set B of sampled beliefs, the cost of solving I-POMDP Lite is divided 
into two parts: (a) The cost of predicting the mixed strategy of the other agent using nested MDP (|2]) 
is 0(fc/i|5p) (Sectionpli; (b) To determine the cost of approximately solving I-POMDP Lite with 
respect to this predicted mixed strategy, since each backup step incurs 0(|C/||y||0||i3p|S'|) time, 
solving I-POMDP Lite for h steps incurs 0(/i|f/||y||0||B|2|S'|) time. By considering | [/ 1 , | T^ | , and 
I O I as constants, the cost of solving I-POMDP Lite can be simplified to O (/i | S* 1 1 i? p ) . Thus, the time 
complexity of solving I-POMDP Lite is ©(/ijS'K/clS'l + |-Bp)), which is much less computationally 
demanding than the exponential cost of I-PF and I-PB VI (SectionfTl). 

4 Theoretical Analysis 

In this section, we prove that I-POMDP Lite inherits convergence, piecewise-linear, and convex 
properties of POMDP that make it amenable to be solved by existing sampling-based algorithms. 
More importantly, we show that the performance loss incurred by I-POMDP Lite is linearly bounded 
by the error of prediction of the other agent's strategy. This result also holds for that of nested MDP 
policy because I-POMDP Lite reduces to nested MDP under full observability. 

Theorem 2 (Convergence). Let V^o be the value function of I-POMDP Lite for infinite time horizon. 
Then, it is contracting/converging: || V"oo - K+iHoo < (t>\\Voo - K||oo- 

Theorem 3 (Piecewise Linearity and Convexity). The optimal value function Vn can be represented 
as a finite set of a vectors: Vn{b) = maxQgv„ (a • h). 

We can prove by induction that the number of a vectors grows exponentially with the length of plan- 
ning horizon; this explains why deriving the exact I-POMDP Lite policy is intractable in practice. 

Definition 1. Let Tr*^ be the true strategy of the other agent -t such that 7r!j(s, v) denotes the true 
probability Pr*{v\s) of selecting action v G V in state s £ S for agent -t. Then, the prediction 
error is Ep = max^j^s |Pr*(?j|s) — Pr(w|s)|. 

Definition 2. Let -Rmax = niaxs,„_„ R{s^ u, v) be the maximum value of our agent t 's payoffs. 

Theorem 4 (Policy Loss). The performance loss (5„ incurred by executing I-POMDP Lite policy, 
induced w.r.t the predicted strategy tt^^ of the other agent -t using nested MDP (as compared to its 
true strategy tt*^), after n backup steps is linearly bounded by the prediction error ep : 



Sn < 2ep\V\Rn 



1 






The above result implies that, by increasing the accuracy of the prediction of the other agent's 
strategy, the performance of the I-POMDP Lite policy can be proportionally improved. This gives a 
very strong motivation to seek better and more reliable techniques, other than our proposed nested 
MDP framework, for intention prediction. The formal proofs of the above theorems are provided in 
Appendix D. 
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Figure 1 : Intersection Navigation: (a) the road intersection modeled as a 7 x 7 grid (the black areas 
are not passable); and (b) accidents caused by cars crossing trajectories. 

5 Experiments and Discussion 



This section first evaluates the empirical performance of nested MDP in a practical multi-agent task 
called Intersection Navigation for Autonomous Vehicles (INAV) (Section 5.1 1, which involves a traf- 
fic scenario with multiple cars coming from different directions (North, East, South, West) into an 
intersection and safely crossing it with minimum delay. Our goal is to implement an intelligent au- 
tonomous vehicle (AV) that cooperates well with human-driven vehicles (HV) to quickly and safely 
clear an intersection, in the absence of communication. Then, the performance of I-POMDP Lite 
is evaluated empirically in a series of partially observable stochastic games (POSGs) (Section [572| . 
All experiments are run on a Linux server with two 2.2GHz Quad-Core Xeon E5520 processors and 
24GB RAM. 



5.1 Nested MDP Evaluations 

In this task, the road intersection is modeled as a 7 x 7 grid, as shown in Fig. Tk. The autonomous 
car (AV) starts at the bottom row of the grid and travels North while a human-driven car (HV) starts 
at the leftmost column and travels to the East. Each car has five actions: slow down (0), forward 
right (1), forward left (2), forward (3) and fast forward (4). Furthermore, it is assumed that 'slow 
down' has speed level 0, 'forward left', 'forward', and 'forward right' have speed level 1 while 'fast 
forward' has speed level 2. The difference in speed levels of two consecutive actions should be at 
most \. In general, the car is penalized by the delay cost D > for each executed action. But, if 
the joint actions of both cars lead to an accident by crossing trajectories or entering the same cell 
(Fig.[T[5), they are penalized by the accident cost C > 0. The goal is to help the autonomous car to 
safely clear the intersection as fast as possible. So, a smaller value of D/C is desired as it implies a 
more rational behavior in our agent. 
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The above scenario is modeled using nested MDP, 
which requires more than 18000 states. Each state com- 
prises the cells occupied by the cars and their current 
speed levels. The discount factor (f) is set to 0.99. The 
delay and accident costs are hard-coded as Z) = 1 and 
C = 100. Nested MDP is used to predict the mixed 
strategy of the human driver and our car's optimal policy 
is computed with respect to this predicted mixed strat- 
egy. For evaluation, our car is run through 800 inter- 
section episodes. The human-driven car is scripted with 
the following rational behavior: the human-driven car 
probabilistically estimates how likely a particular action 
will lead to an accident in the next time step, assuming 
that our car selects actions uniformly. It then forms a 

distribution over all actions such that most of the probability mass concentrates on actions that least 
likely lead to an accident. Its next action is selected by sampling from this distribution. 
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Figure 2: Performance comparison be- 
tween nested MDPs at reasoning levels 0, 

1, and 2. 



We compare the performance of nested MDPs at reasoning levels k — 0, 1, 2. When fc = 0, it 
is equivalent to the traditional MDP policy that treats the other car as environmental noise. During 
execution, we maintain a running average Tt (over the first t episodes) of the number of actions taken 
to clear an intersection and the number It of intersections experiencing accidents. The average ratio 
of the empirical delay is defined as Rf = (Tt — T'min)/21iiin — Ti/Tinin ^ 1 with Tmin = 3 
(i.e., minimum delay required to clear the intersection). The empirical accident rate is defined as 
Rt — It/t. The average incurred cost is therefore Alt — CRl + DRf. A smaller Mt implies better 
policy performance. 

Fig. l2] shows the results of the performance of the evaluated policies. It can be observed that the 
Mt curves of nested MDPs at reasoning levels 1 and 2 lie below that of MDP policy (i.e., reasoning 
level 0). So, nested MDP outperforms MDP. This is expected since our rationality assumption holds: 
nested MDP's prediction is closer to the human driver's true intention and is thus more informative 
than the uniformly-distributed human driver's strategy assumed by MDP. Thus, we conclude that 
nested MDP is effective when the other agent's behavior conforms to our definition of rationality. 

5.2 I-POMDP Lite Evaluations 

Specifically, we compare the performance of I-POMDP Lite vs. I-POMDP (at reasoning level k — 1) 
players under adversarial environments modeled as zero-sum POSGs. These players are tasked to 
compete against nested MDP and I-POMDP opponents at reasoning level fc = (i.e., respectively, 
MDP and POMDP opponents) whose strategies exactly fit the structural assumptions of I-POMDP 
Lite (Section 3]l and I-POMDP (at k = 1), respectively. The I-POMDP player is implemented using 
I-PBVI which is reported to be the best approximate I-POMDP technique |4|. Each competition 
consists of 40 stages; the reward/penalty is discounted by 0.95 after each stage. The performance of 
each player, against its opponent, is measured by averaging its total rewards over 1000 competitions. 
Our test envkonment is larger than the benchmark problems in L4J: There are 10 states, 3 actions, 
and 8 observations for each player. In particular, we let each of the first 6 states be associated with 
a unique observation with high probability. For the remaining 4 states, every disjoint pair of states 
is associated with a unique observation with high probability. Hence, the sensing capabilities of 
I-POMDP Lite and I-POMDP players are significantly weaker than that of the MDP opponent with 
full observability. 

Tablefllshows the results of I-POMDP and I-POMDP Lite players' performance with varying hori- 
zon lengths. The observations are as follows: (a) Against a POMDP opponent whose strategy 
completely favors I-POMDP, both players win by a fair margin and I-POMDP Lite outperforms I- 
POMDP; (b) against a MDP opponent, I-POMDP suffers a huge loss (i.e., -37.88) as its sti'uctural 
assumption of a POMDP opponent is violated, while I-POMDP Lite wins significantly (i.e., 24.67); 
and (c) the planning times of I-POMDP Lite and I-POMDP appear to, respectively, grow linearly 
and exponentially in the horizon length. 

Table 1: I-POMDP's and I-POMDP Lite's performance against POMDP and MDP opponents with 
varying horizon lengths h {\S\ — 10, |^| = 3, \0\ = 8). '*' denotes that the program ran out of 
memory after 10 hours. 







POMDP 


MDP 


Time (s) 


|/5'| 


I-POMDP (h = 2) 




13.33±1.75 


-37.88±1.74 


177.35 


66110 


I-POMDP (h = 3) 




* 


* 


* 


1587010 


I-POMDP Lite (h = 


= 1) 


15.22±1.81 


15.18±1.41 


0.02 


N.A. 


I-POMDP Lite (h = 


= 3) 


17.40±1.71 


24.23±1.54 


0.45 


N.A. 


I-POMDP Lite (h = 


= 8) 


17.42±1.70 


24.66±1.54 


17.11 


N.A. 


I-POMDP Lite (h = 


= 10) 


17.43±1.70 


24.67±1.55 


24.38 


N.A. 



I-POMDP's exponential blow-up in planning time is expected because its bounded interactive state 
space IS' increases exponentially in the horizon length (i.e., curse of dimensionality), as shown in 
Tab le [T] Such a scalability issue is especially critical to large-scale problems. To demonstrate this. 
Fig. |3b shows the planning time of I-POMDP Lite growing linearly in the horizon length for a large 
zero-sum POSG with 100 states, 3 actions, and 20 observations for each player; it takes about 6 and 
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Figure 3: Graphs of (a) performance R of I-POMDP Lite and I-POMDP players against hybrid 
opponents (|5| = 10, \A\ = 3, \0\ = 8); (b) I-POMDP Lite's planning time vs. horizon length h in 
a lai-ge POSG (IS*! = 100, \A\ = 3, \0\ = 20); and (c) I-POMDP Lite's planning time vs. reasoning 
level kfoTh = l and 10 (IS*] = 10, |A| = 3, \0\ = 10). 



Table 2: I-POMDP's and I-POMDP Lite's performance against POMDP and MDP opponents with 
varying horizon lengths h {\S\ — 10, \A\ — 3, \0\ — 10). '*' denotes that the program ran out of 
memory after 10 hours. 







POMDP 


MDP 


Time (s) 


IS'\ 


I-POMDP (h = 2) 




5.70±1.67 


-9.62±1.50 


815.28 


102410 


I-POMDP (h = 3) 




* 


* 


* 


3073110 


I-POMDP Lite (h = 


= 1) 


11.18±1.75 


20.25±1.53 


0.03 


N.A. 


I-POMDP Lite (h = 


= 3) 


14.89±1.79 


27.49±1.53 


0.95 


N.A. 


I-POMDP Lite (h = 


= 8) 


14.99±1.79 


26.91±1.55 


24.10 


N.A. 


I-POMDP Lite (h = 


= 10) 


15.01±1.79 


26.91±1.55 


33.74 


N.A. 



1/2 hours to plan for 100-step look-ahead. In contrast, I-POMDP fails to even compute its 2-step 
look-ahead poUcy within 12 hours. 

It may seem surprising that I-POMDP Lite outperforms I-POMDP even when tested against a 
POMDP opponent whose strategy completely favors I-POMDP. This can be explained by the follow- 
ing reasons: (a) I-POMDP's exponential blow-up in planning time forbids it from planning beyond 
3 look-ahead steps, thus degrading its planning performance; (b) as shown in Section [3] the cost 
of solving I-POMDP Lite is only polynomial in the horizon length and reasoning depth, thus al- 
lowing our player to plan with a much longer look-ahead (Fig. 13b) and achieve substantially better 
planning performance; and (c) with reasonably accurate observation and transition models. Theo- 
rem [T] indicates that the strategy of the MDP opponent (i.e., nested MDP at reasoning level 0) is 
likely to approximate that of the true POMDP opponent closely, thus reducing the degree of vio- 
lation of I-POMDP Lite's structural assumption of a nested MDP opponent. Such an assumption 
also seems to make our I-POMDP Lite player overestimate the sensing capability of an unforeseen 
POMDP opponent and consequently achieve a robust performance against it. On the other hand, 
the poor performance of I-POMDP against a MDP opponent is expected because I-POMDP's struc- 
tural assumption of a POMDP opponent is likely to cause its player to underestimate an unforeseen 
opponent with superior sensing capability (e.g., MDP) and therefore perform badly against it. In 
contrast, I-POMDP Lite performs significantly better due to its structural assumption of a nested 
MDP opponent at level which matches the true MDP opponent exactly. 

Interestingly, it can be empirically shown that when both players' observations are made more in- 
formative than those used in the previous experiment, the performance advantage of I-POMDP Lite 
over I-POMDP, when tested against a POMDP opponent, increases. To demonstrate this, we modify 
the previous zero-sum POSG to involve 10 observations (instead of 8) such that every state (instead 
of a disjoint pair of states) is associated with a unique observation with high probability (i.e., > 0.8); 
the rest of the probability mass is then uniformly distributed among the other observations. Hence, 
the sensing capabilities of I-POMDP Lite and I-POMDP players in this experiment are much bet- 
ter than those used in the previous experiment and hence closer to that of the MDP opponent with 
full observability. Table l2] summarizes the results of I-POMDP Lite's and I-POMDP's performance 
when tested against the POMDP and MDP opponents in the environment described above. 



To further understand how the I-POMDP Lite and I-POMDP players perform when the sensing 
capabiHty of an unforeseen opponent varies, we set up another adversarial scenario in which both 
players pit against a hybrid opponent: At each stage, with probability p, the opponent knows the 
exact state of the game (i.e., its belief is set to be peaked at this known state) and then follows 
the MDP policy; otherwise, it follows the POMDP policy. So, a higher value of p implies better 
sensing capability of the opponent. The environment settings are the same as those used in the 
first experiment, that is, 10 states, 8 observations and 3 actions for each player (Table fT). Fig.l3t 
shows the results of how the performance, denoted R, of I-POMDP Lite and I-POMDP players vary 
with p: I-POMDP's performance decreases rapidly as p increases (i.e., opponent's strategy violates 
I-POMDP's structural assumption more), thus increasing the performance advantage of I-POMDP 
Lite over I-POMDP. This demonstrates I-POMDP Lite's robust performance when tested against 
unforeseen opponents whose sensing capabilities violate its structural assumption. 

To summarize the above observations, (a) in different partially observable environments where the 
agents have reasonably accurate observation and transition models, I-POMDP Lite significantly 
outperforms I-POMDP (Tables [T] l2] and Fig. [3h); and (b) interestingly, it can be observed from 
Fig. |3^ that when the sensing capability of the unforeseen opponent improves, the performance 
advantage of I-POMDP Lite over I-POMDP increases. These results consistently demonstrate I- 
POMDP Lite's robust performance against unforeseen opponents with varying sensing capabilities. 
In contrast, I-POMDP only performs well against opponents whose strategies completely favor it, 
but its performance is not as good as that of I-POMDP Lite due to its limited horizon length caused 
by the extensive computational cost of modeling the opponent. Unlike I-POMDP's exponential 
blow-up in horizon length h and reasoning depth k (Section [T]i, I-POMDP Lite's processing cost 
grows linearly in both h (Fig. [3b) and k (Fig. [3b). When h = 10, it can be observed from Fig. l3b 
that I-POMDP Lite's overall processing cost does not change significantly with increasing k because 
the cost 0[kh\S\^) of predicting the other agent's strategy with respect to k is dominated by the cost 
0{h\S\\B\^) of solving I-POMDP Lite for large h (Section[3]). 

6 Conclusion 

This paper proposes the novel nested MDP and I-POMDP Lite frameworks which incorporate 
the cognitive hierarchy model of games |1 1 for intention prediction into the normative decision- 
theoretic POMDP paradigm to address some practical limitations of existing planning frameworks 
for self-interested MAS such as computational impracticality |4 1 and restrictive equilibrium theory 
of agents' behavior ||9l- We have theoretically guaranteed that the performance losses incurred by 
our I-POMDP Lite policies are linearly bounded by the error of intention prediction. We have empir- 
ically demonstrated that I-POMDP Lite performs significantly better than the state-of-the-art plan- 
ning algorithms in partially observable stochastic games. Unlike I-POMDP, I-POMDP Lite's perfor- 
mance is very robust against unforeseen opponents whose sensing capabilities violate the structural 
assumption (i.e., of a nested MDP opponent) that it has exploited to achieve significant computa- 
tional gain. In terms of computational efficiency and robustness in planning performance, I-POMDP 
Lite is thus more practical for use in larger-scale problems. 

References 

[1] C. F. Camerer, T. H. Ho, and J. K. Chong. A cognitive hierarchy model of games. Quarterly 
J. Economics, 119(3):861-898, 2004. 

[2] G. ChaUciadakis and C. Boutilier Coordination in multiagent reinforcement learning: A 
Bayesian approach. In Proc. AAMAS, pages 709-716, 2003. 

[3] P. Doshi and P. Gmytrasiewicz. Monte Carlo sampling methods for approximating interactive 
POMDPs. JAIR, pages 297-337, 2009. 

[4] P. Doshi and D. Perez. GeneraUzed point based value iteration for interactive POMDPs. In 
Proc. AAAI, pages 63-68, 2008. 

[5] G. Gigerenzer and R. Selten. Bounded Rationality. MIT Press, 2002. 

[6] P. J. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multi-agent settings. 
JAIR, 24:49-79, 2005. 



10 



[7] E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observ- 
able stochastic games. In Proc. AAAI, pages 709-715, 2004. 

[8] T. N. Hoang and K. H. Low. A general framework for interacting Bayes-optimally with self- 
interested agents using arbitrary parametric model and model prior In Proc. IJCAI, 2013. 

[9] J. Hu and M. P. Wellman. Multi-agent reinforcement learning: Theoretical framework and an 
algorithm. In Proc. ICML, pages 242-250, 1998. 

[10] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proc. 
ICML, pages 157-163, 1994. 

[11] R. Nair and M. Tambe. Taming decentralized POMDPs: Towards efficient policy computation 
for multiagent settings. In Proc. IJCAI, pages 705-71 1, 2003. 

[12] B. Ng, C. Meyers, K. Boakye, and J. Nitao. Towards applying interactive POMDPs to real- 
world adversary modeling. In Proc. lAAI, pages 1814-1820, 2010. 

[13] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for 
POMDPs. In Proc. IJCAI, pages 1025-1032, 2003. 

[14] B. Rathnasabapathy, P. Doshi, and P. Gmytrasiewicz. Exact solutions of interactive POMDPs 
using behavioral equivalence. In Proc. AAMAS, pages 1025-1032, 2006. 

[15] S. Seuken and S. Zilberstein. Memory-bounded dynamic programming forDEC-POMDPs. In 
Proc. IJCAI, pages 2009-2015, 2007. 

[16] M. T. J. Spaan, F. A. Oliehoek, and C. Amato. Scaling up optimal heuristic search in DEC- 
POMDPs via incremental expansion. In Proc. IJCAI, pages 2027-2032, 201 1. 



11 



A Proof Sketch of Theorem 1 



Theorem 1. Let Q"t{s, v) = \U\ ^ J2ueu Q-t"(*i '^i ^) ^^'^ Q^ti^i '^) denote n-step-to-go values 
of selecting action v &V in state s ^ S and belief b, respectively, for the other agent -t using nested 
MDP and I-POMDP at reasoning level (i.e., MDP and POMDP). If b{s) > 1 - e and 



V (s, V, u) 3 {s\ o) Pr{s'\s, v,u)>l~ - A Pr{o\s, v) > 1 



for some e > 0, then 



Q!',(s,z;)-g?,(6,z;) 



<eO 



R„ 



i?n 



1- 



(8) 



where i?max and i?niin denote agent -i's maximum and minimum immediate payoffs, respectively. 

Proof: Assuming that ([HJ holds with n (it is trivial to verify that it holds with n — 0), we need to 
prove that it also holds with n + 1. Define L"^^{s, v) as the optimal expected utility of the other 
agent if it knows that the first state is s and executes v in the first step. From the second step, it only 
observes the environmental state partially (i.e., same as the POMDP agent). Then, 

Q:\+\s,v)-Q:\+\b,v)\ < \q:+\s,v)-l:\+\s,v) 

+ L:!+\s,v)-Q[\+\b,v) 



I. Note that Q^f^^{b,v) = 'J2s' '^vi^')H^') due to piecewise linearity and convexity and 
L!!f^^{s,v) = a* (s), by definition. Assuming that L!'(+^(s,i;) > (3!^+^ (6, u), the second term 
in the right-hand side of the above inequality is linearly bounded w.rt e, as shown below: 

L^+'is,v)-Q^+'ib,v) < K(s)-mina:(s'))(l-fo(s)) 

< e(a;(s)-mina;(s')) 

Similarly, if L!'+^(s,t;) < Q!'+^(6,u), we can instead show that Q?+\fe,i;) - L?+^(s,u) < 
e(max5'^s al{s') — q;*(s)). Thus, it follows that 



L-+\s,v)-Ql\+' 



(b^v) < e I maxa*(s') — niina*(s') 

\ s' s' 



< eO 



R 



max -* <-min 



Ru 



1-0 



(9) 



II. From our definitions of QIj (s, v) and L^^ (s, v), it directly follows that 



M ' ' \ S' 



Tris')V.Us') 



where Vl"(s') = maxyr Q"f.{s' ,v'), R-t{s,v,u) is the other agent's immediate payoff, and 
T"''"(s') = Pr{s'\s, V, u). Similarly, ^^^^^(s, v) can be expressed as 

L^t'-'is, v) = E^ Utis, V, u) + 4>Y,Tris')z:;,{o)v.^,{bin] 

u ' ' \ s' ,o j 

where K?(6^'°) = max„- Q!?,(65;^°,w'). ^."(o) - PAo\s' ,v), and 6^'°(s') - 

E,"^:.(o)rr(.") 



Then, 



g."+\s,«)-i:!'+i(s,t;)| <0E^^(") 



(10) 



■*By definition of Ul^^is^ v), the previous state s is known exactly. So, we do not need to take the average 
w.r.t b{s). 
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where FH ^ Ey Tj-'^Cs') K?(s') " Eo ^."W^-^l^r 



Let s* be the state at which 



T^-'^is*) > 1- -, we have 



F{u) < J2 ^"'"(^O 



T/?(s')-^^;(o)K?(6r) 



+ T^^^s*) 


y.^(5*) - 


-)_^z:,{o)v.",{bin 




1 > ; T^i-^') 


+ T;^"(s*) 


n'(5*) - 


-y,z:,io)v-{bin 

o 



(11) 



(12) 



Since T,"'^(s*) > 1 - - and Ey^^* ^"'"(s') = 1 - ^^'^^(s*), it follows directly that the first term 
in the right-hand side of ([12]) is bounded by -©((i^max - -Rmin)/(1 - 4>))- Then, let G{s*,u) = 

T^'^'is*) K?(s*) - Eo ^J- io)VJiibl'°) and let o* be the value at which Z^. (o*) > 1 - -. 



+ K{bi^''')-Y,z':,{o)v-ibin 



(13) 



Since the first term on the right-hand side of ( [T3] l is trivially bounded by 

max„- g!?t(s*,u')-Q"t(^«°',w') and the posterior belief 6;;^°*(s*) > r2''"(s*)Z3".(o*) > 

1 J > 1 — e, we can bound it by eC'((i?max — -Rmin)/(1 ~ 0)) using our inductive assump- 
tion. Also, using the same argument in part I, the second term is bounded by -O ( — ^^^^ — 

2 V 1-0 



This implies G{s* , u) and hence F{u) are bounded by eO 
( [Tol l, we can prove that Q!i+^ (s, v) - i?t+^ (s, v) 



Rn 



-'I'm in 



1 



Plugging F{u) into 






III. Putting the results in parts I and II together, it follows that (jSj holds for n + 1 as well. 

B Empirical Evaluations for Nested MDP 

This section evaluates the performance of nested MDP in a two-player, zero- 
sum Markov game modeled after soccer In particular, the soccer game in- 
troduced in [Littman, 1994] is played on a 4 x 5 grid, as shown in Fig. |4] 
At any moment, the two players A and B occupy different squares and 
have five actions: top, down, left, right, and stand. Only one of them has 
the ball. Once they have selected their actions, the moves are executed in 
random order When the player with the ball moves into its assigned goal 
(left and right shaded squares for A and B, respectively), it gets one point 
and its opponent loses a point and vice versa. The game is then reset to 
the initial configuration shown in Fig. [4] with the ball possession going 
to either one at random. When a player steps into the square occupied Figure 4: Soccer game, 
by its opponent, the ball possession goes to its opponent and the move does not take place. The 
discount factor is set to 0.9, which makes scoring sooner better than scoring later. 

The performance of nested MDP player at level fc = 1 is compared against that of the state-of- 
the-art Nash-Q [Hu and Wellman, 1998] and the MDP planning players. These players are tasked 
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to compete against opponents employing Nash-Q and MDP planning, a random policy selecting 
actions uniformly, and a hand-built policy specifying simple scoring and blocking rules: 

Scoring. This set of tactics applies when the agent possesses the ball. At any step, if the agent is 
not on the right track (i.e., its current row goes straight to a goal), it will consider switching to the 
closest track while avoiding its opponent. Otherwise, it will try to reach the goal as fast as possible 
without losing the ball to the defending opponent. 

Specifically, when the agent is on the right track, it will just keep moving towards the goal if the 
opponent is not blocking. Otherwise, it will only move forward with a certain probability which is 
set to be proportional to its distance from the opponent. If the agent decides not to move forward, it 
will consider staying still with a small probability to maintain the ball. If not, it either moves left or 
right with equal probability to keep away from the opponent. 

When the agent is not on the right track, it simply moves to the closest track if the defending oppo- 
nent is left behind or too far ahead (i.e., more than two steps ahead). Otherwise, if the opponent is 
close up ahead, it will consider either moving forward (with the hope of getting past its opponent) 
or switching to the right track with certain probabilities. Again, we make the agent's probability of 
moving forward proportional to its distance from the opponent. 

Blocking. When the agent loses the ball, the following tactics apply: The agent will chase after the 
opponent (who is rushing towards its corresponding goal) if it is left behind. Otherwise, it either 
switches to the same track (i.e., row) as the opponent or stays still if their tracks are currently the 
same. That is, the agent always tries to stand in the way of the opponent, which appears to be a very 
effective defending strategy [Littman, 1994]. 

Each competition between a player and its opponent consists of a number of games, each of which 
ends immediately after either the first goal or being declared as a draw with 0.1 probability at every 
step. But, each competition only ends after 10000 games that do not end as a draw. 

TablelSlshows the results of the performance of the tested players. The observations are as follows: 

(a) Against a Nash-Q opponent, 
the nested MDP player's perfor- 
mance is comparable to that of 
the Nash-Q player. In theory, 
the best strategy against a Nash- 
opponent should be a corre- Table 3: Comparison of number of goals scored in a competition 
sponding one in the same Nash ^y different players: the number of moves taken to complete 
equilibrium. This implies the 10000 goals are specified in brackets. 

nested MDP player achieves a near-optimal policy against the Nash-Q opponent. Intuitively, al- 
though the nested MDP player does not expect the opponent to play Nash strategy, it can roughly 
predict which of its opponent's moves would potentially lead to its loss and how likely these moves 
will be executed. Exploiting this knowledge, it plans a policy that rationally trades off between 
defense and offense, thereby achieving near-optimal performance against the Nash-Q opponent. 

(b) Against a MDP opponent, nested MDP and Nash-Q players' performance are superior to that of 
the MDP player who wrongly assumes the random action selection by the opponent, thereby leading 
to an offensive policy with weak defense. Notably, while nested MDP and Nash-Q players' perfor- 
mance are comparable in this case, the nested MDP player completed the competition significantly 
faster (i.e., fewer moves taken) than the Nash-Q player. 

(c) Against a random-policy opponent, Nash-Q and MDP players' performance are almost similar 
while the nested MDP player performs significantly better Upon close examination, we observe 
that, in such a two-player zero-sum game, the Nash-Q strategy coincides with the maximin strategy. 
So, its corresponding policy is too risk-averse and fails to fully exploit the weaknesses of the random- 
policy opponent, thereby losing some chances of winning. In contrast, the MDP player performs 
well because its offensive strategy is generally effective against the weak random-policy opponent. 
Lastly, since the nested MDP player always has a better balance between defense and offense, its 
performance is the best among the players. These behaviors can be observed from the number of 
moves the players took to finish the competition against the same opponent. 



~""--- — __^^^^ Players 
Opponents """"""~ — — --__^ 


Nash-Q 


Nested MDP 


MDP 


Nash-Q 


4970 (173051) 


4955 (140867) 


2393 (74545) 


MDP 


7602 (74458) 


7558 (53946) 


4983 (37277) 


Random 


9771 (128388) 


9936 (96865) 


9770 (77992) 


Hand-built 


5575 (130843) 


6069 (99698) 


2880 (56854) 
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(d) Against a hand-built policy opponent, Nash-Q and nested MDP players win more than half the 
time while the MDP player performs badly due to its ignorance of the opponent's intention, as shown 
in previous experiments. In particular, the nested MDP player significantly outperforms the Nash- 
Q player: the nested MDP player, while planning, predicts the opponent's intention by recursively 
reasoning in its place, thereby recognizing its opponent's critical moves that can potentially lead to 
its loss. Exploiting this knowledge, the nested MDP player usually has a better trade-off between 
offense and defense than the Nash-Q player who only focuses on the worst-case situations, thereby 
losing chances to score if its opponent does not make moves that lead to such situations. 

Essentially, these observations show that nested MDP (1) significantly outperforms MDP player 
(against all opponents) and Nash-Q player (against Random and Hand-built opponents) and (2) 
performs comparably to Nash-Q player (against Nash-Q and MDP opponents). This suggests that 
nested MDP is a better planning framework under fully observable settings as compared against the 
previous ones such as Nash-Q and MDP planning. 



C Absolute Continuity Condition of Interactive Beliefs for I-PBVI 

In this section, we provide a more detailed discussion on the Absolute Continuity Condition (ACC) 
of interactive beliefs [Doshi and Perez, 2008], which is crucial to the mathematical soundness and 
hence the feasibility of using I-PBVI. Indeed, we have ensured that this condition is met in all of our 
experiments so that I-PBVI is not put at a disadvantage against I-POMDP Lite. Intuitively, I-PBVI 
aims to make the planning of I-POMDP tractable by constraining the infinite interactive state space 
IS ~ S X A(5) to a finite space IS' — S x Reach(i3, h) bounded with respect to the length h of 
planning horizon. Since Reach(i3, h) includes all beliefs that can be reached from B by following a 
certain action-observation history of length h, if the true initial belief of the other agent is included 
in B, then all its reachable beliefs within h steps of interaction are included in Reach(i3, h) as well. 
This essentially means that IS' includes all interactive states of IS that can be assigned a non-zero 
probability within h steps of interaction and hence guarantees the correctness of the interactive belief 
update on IS'. 

On the other hand, if B does not include the other agent's true initial belief, it is possible that 
the other agent actually reaches an interactive state not included in Reach(i?, h) within h steps 
of interaction. Consequently, this causes the loss of probability mass while updating the interactive 
belief: if the action-observation history assigns a non-zero probability mass to an interactive state not 
included in IS', then this mass will be lost because the interactive belief update step only considers 
those included in IS'. As the interaction proceeds, the probability mass will disappear gradually 
until the whole probability mass, which should sum to 1, is completely lost (i.e., the actual action- 
observation history assigns zero probability to all interactive states included in IS'). At this point, 
the agent will become indifferent between the actions and thus perform terribly. We have also 
verified this by trying to exclude the other agent's true initial belief from B: after a few steps, 
our agent's interactive belief disappears and its performance starts to degrade (i.e., getting negative 
rewards). 

So, the ACC condition simply requires that the other agent's true initial belief should be included in 
B. However, the specification of B is often complicated: if B is made too large just to increase the 
chance of including the other agent's true initial belief, the planning process may become computa- 
tionally intractable since |/S"| = l^l X |Reach(S,/i)| = 0{\S\\B\\U\^\V\^\0\''). Otherwise, there 
is a need for highly accurate and informative prior knowledge that can satisfy the ACC condition 
with a smaller-sized B. While such detailed prior knowledge of the other agent is assumed in all 
our experiments as well as the benchmark problems reported in [Doshi and Perez, 2008], one should 
be cautioned that it cannot be easily accessed nor processed in real-world situations (e.g., involving 
humans). This is another obstacle that Umits I-PBVI's practical use. 



D Proofs 

We hereby present our formal proofs for the theorems stated in SectionH] To increase the readability 
of this section, we structure our proof in three separate sections: 
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• Definitions (Section D.l i - we introduce the conventional definitions used in this analysis, some 
of which are previously stated in Section |4] 



• Intermediate Results (Section D.2 1 - we state and prove several intermediate results, which are 
necessary to prove our main theorems. 

• Main Tlieorems (Section D.3 1 - finally, we formally derive the main results stated in Section]?] 



D.l Definitions 

Definition 1. Let i?max = maxs.„ ,j R{s, u, v) be the maximum value of our agent's payoffs and let 
7r*j be the true mixed strategy of the other agent -t such that 7r!((s, v) denotes the true probability 
Pr* {v\s) of agent -t selecting action ?; S F in state s ^ S. Then, the prediction error is 

Cp = max \Pr* (v\s) — Pr(v\s)\ . 

Definition 2. Given the true mixed strategy tt*^{s,v) = Pr*{v\s) of the agent -t, let us denote 
Pr*{v,o\b,u), B*{b,u,v,o) and V*{b) as the belief-state observation model, belief-state update 
function and optimal value function over time horizon n computed with respect to Pr* {v\s). 

Definition 3. Let Vn{b) denotes the optimal value function computed with respect to the pre- 
diction Pr{v\s). The difference between Vn{b) and the true optimal value function V*{b) is 
Sn^mi,Xb\V:{b)-Vnib)\. 

Definition 4. Let Q^ib, u) and Qn{b, u) denote the corresponding Q-functions of V* (6) and Vn{b): 

V*{b) = maxQ;(6,'u) 

u 

Vn{b) = max(5„(&, m) . 

u 

Consequently, we define the optimal policies tt* and n induced from V* and Vn, respectively, as: 

TT* (6) — arg max Q* (6, u) 

u 

7r(6) = argmax(3„(6, u) . 

u 

Definition 5. Let J* (6) and J„(5) denote the expected total rewards if, starting from belief b, we 
follow the optimal policy tt* and the induced policy tt for n steps, respectively. Thus, we have: 

Jm = Y.^{s)Pr*{v\s)R{.s,7:*ib),v) 

S,V 

+ <j,J2Pr*{v,o\b,7r*{b})j:_,{B*{b,7T*{b),v,o)) 
Mb) = Y.b{s)Pr*{v\s)R{s,7T{b),v) 

s,v 

+ <l>Y,Pr*{v,0\b,TT{b))J,,MB*{b,7T{b),V,0)) . 
v.o 

Consequently, it can be verified by the definition of the value function V* that J,* = V* . 

D.2 Immediate Results 

Lemma 1. Given a prediction TT-t{s,v) ^ Pr{v\s) of the other agent -i's mixed strategy, the 
difference between the optimal value function Vn and the true optimal value function V* of our 
agent t, with respect to agent -f s true mixed strategy Tr*t{s, v) = Pr*{v\s), is bounded by: 

Sn < maxmax|(3*(&, u) - Q„(6, u)| . (14) 

b u 

Proof: Let us denote u' = arg max„ Qn {b, u) and u* ~ arg max^^ Q* (b, u). Assume that V* (b) < 
Vn{b), we have 

\V:ib)-Vnib)\ = Vn{b)-V:ib)^Qnib,u')-maxQlib,u) 

U 

< Qn{b,u')-Q*ib,u')^\Qnib,u')-QUb,u')\ (15) 
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Similarly, if Vn {b) < V* {b) we also have 

\V:{b)-Vn{b)\ < \Q*Jb,u*)-Qn{b,u*)\ (16) 

Therefore, from inequalities ( [T5] l and ^T6\ , we have 

\V:{b)-V„{b)\ < m^x{\Q,,{b,u')-Q:,{b,u')\,\Q*^{b,u*)-Qnib,u*)\) 

< max|g;(6,M)-Q„(6,u)| (17) 

u 

From ( [TTJ i and Definition 3, we have 5n < max(, max„ \Qn{b, u) — Q„(6, u)| D 

Lemma 2. Suppose that at current belief b, our agent (i.e., agent t) executes action u. We defined 
the expected immediate payoffs with respect to our predictive distribution TT-t{s, v) = Pr{v\s) and 
the true mixed strategy Tr*f{s,v) = Pr*{v\s) of agent -t as 

R*{b,u) = ^6(s)^i?(s,w,i;)Pr*(z;|s) 

S V 

R{b,u) - Y.^{s)Y,R{s,u,v)Pr{v\s) . 

S V 

The difference between R* {b, u) and R{b, u) is linearly bounded by the prediction error Cp'. 

|i?*(6,u)-i?(6,u)|<ep|y|i?,nax. 
Proof; We have 

\R*{b,u)~R{b,u)\ < ^6(s)^i?(s,u,w)|Pr*(w|s)-Pr(w|s)| 

S V 

Si; s 

< \V\Rrn..ep. (18) 

n 

Lemma 3. The difference between the belief-state observation models Pr*{v,o\b,u) and 
Pr{v, o\b, u) with respect to agent -i's true mixed strategy 7r*j(s, v) — Pr*{v\s) and our predic- 
tive distribution TT-t{s, v) = Pr{v\s) is bounded by the prediction error ep: 

\Pr*{v, o\b,u) — Pr{v, o\b,u)\ < Cp . 

Proof: We have 

\Pr*{v, o\b, u) - Pr{v, o\b, u)| < ^ Z{s', u, o) ^ T{s, u, v, s')b{s)\Pr* {v\s) - Pr{v\s)\ 

s' s 

< Cp y ^ b{s) y ^ Z{s' , u, o)T{s, u, V, s') 

s s' 

< ep^&(s)^7^(s,",«,s') = ep^&(s) = ep. (19) 

s s' s 

n 

Lemma 4. For any two beliefs b and b', if ||6 - b'\\i < 6, we have \V{b) - V{b')\ < f^5 . 
(Lipschitz condition) 
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Proof: For any two beliefs b and b', let a = a,Tginax^^y{ab) and a' — argmax^gy(Q!6'). Without 
loss of generality, assuming that V{b') < V{b), we have 

\V{b)~V{b')\ = ab-a'b' 

< ab — ab = a(b — b ) 

s s 

< iiiaxa(s) y \b{s) — h'{s)\ < inSixa{s)S 

s ^ — ^ s 

s 

< ^^^<5 . (20) 

1-0 

The last step follows from the fact that each component a{s) of an a vector is basically an expected 
total reward (if the initial state is s and the agent follows the optimal policy) which is always 
bounded by ^^^ D 

Lemma 5. Let us define the unnormalized belief update function F{b, u, v, o) of B{b, u, v, o) as the 
followings: 

F{b,u,v,o)(s') = Z(s',u,o)^T{s,u,v,s')Pr{v\s)b{s) 

s 

= B{b,u,v,o)Pr{v,o\b,u) . 



The norm-1 distance between the unnormalized belief update functions F*{b,u,v,o) and 
F{b,u,v, o) of B*{b,u,v, o) and B(b,u,v, o), respectively, is at most Cp. 

Proof: Let us shortly write F{b, u, v, o) and F* (6, u, v, o) as /' and /*, respectively. We have 

s' s 

- <^P^Zis',u,o)^T{s,u,v,s')b{s) 

s' s 

= <^P^b(s)^Z{s',u,o)T{s,u,v,s') 

s s' 

s s' 

= '^P^Ks) ^ <^P ■ (21) 

s 

n 

Proposition 1. For all values b E B,u E U,v E V,oE O, we have 

Pr* {v, o\b, u)\\B* (b, u,v,o) — B(b,u,v,o)\\i < 2ep . 

Proof: Let b* and b' denote B* (6, u, v, o) and B{b, u, v, o), respectively. Also, let /* and /' be un- 
normalized version of b* and b' . Finally, let A* and A denote the belief-state observation probability 
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Pr*{v, o\b, u) and Pr{v, o\b, u). Thus, we have 



A*\\b*-b'\\, = A*J2 



A* A 

s' 
s' 

< ^E(^l/*(^') -/'(*')! +/'(*')^p) (Lemma 3) 

= Ei/*(^')-/'(^')i + ^p^E/'(^') 

= E !/*(*') - /'(■'*')l + 'PJ E ^^'(^') (Def. of /'(.)) 

s' s' 

< Cp + Ep = 2ep (Lemma 5) (22) 



n 



Proposition 2. Given a predictive distribution 7r-t(s, w) — Pr{v\s) of agent -fs mixed strategy, we 
have the following inequality: 

\V*{B*{b, u, V, o))Pr*{v, o\b, u) - V*{B{b, u, v, o))Pr{v, o\b, u)\ < Ssp-^ 



Proof: Let b* and b' denote B* (b, u, v, o) and B{b, u, v, o), respectively. Also, let A* and A denote 
the probability Pr* {v, o\b, u) and Pr{v, o\b, u). Thus, we have: 

\V*ib*)A* -V*{b')A\ < \V*ib*)A* -V*ib')A*\ + \V*ib')A* -V*{b')A\ 
= A*\V*{b*)-V*ib')\ + V*{b')\A*-A\ 

< A*^^\\b* - b'Wi + V*{b') \A* - A\ (Lemma 4) 
1 -0 

= ^^^A*||6*-6'||i + V^*(6')|A*-^| 

i?max ^ ^*^^,^ 1^* _ ^1 (Proposition ^^ 



1 

,- + V (b)e.p (Lemma 3) 

1-0 

^niax ^max q -^max 



< 2ep- J + ep- y = 3ep- j . (23) 



The last step follows from the fact that the value function V*(b') is essentially the expected total 
reward if the agent follows the optimal policy from the initial belief. Thus, it is trivially bounded by 

i-<p 

Proposition 3. For all values b E B,u E U,we have: 

|0:(6,M)-g„(5,u)| < cj)Sr,-i + ep\V\R„,^^ (l + 3cj)^^\ . 
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Proof: Let b* and b' denote B* {b, u, v, o) and B{b, u, v, o), respectively. Further, let us define 

L^{b,u) - R{b,u)+,pY.T.^n-i{b')Pr{v,o\b,u). 

V O 

First, we prove that \Ln{b^ u) — Qni^-, y)\ < (/>(Sn_i : 

\L,,{b,u)^Q^ib,u)\ < (/.^Pr(i;,o|6,u)|K_i(6')-^n-i(6')l 

v,o 

< (f) yj Pr{v, o\b, u)6n-i (by def. of 6n) 

v.o 

< ^Sn-iJ2Pr{v,o\b,u) = (l,Sn-i. (24) 

v,o 

Second, we prove that |Q,;(6, u) - L„(6, u)\ < ep|y|i?max (l + 3(j)^] : 
\Qlib,u)-Lnib,u)\ < \R*ib,u)-Rib,u)\ + 

(l)J2\y*ib*)Pr*{v,o\b,u) - V*{b')Pr{v,o\b,u)\ 

< ep\V\Rmax + yj ( 3ep "^^^ j (Lemma 2, Proposition 2) 



■max 



= ep|l^|i?max('l + 3(/.J^') . (25) 

Finally, we prove the main result of this proposition: 

\Q*„{b,u) -Qnib,u)\ < \Ql{b,u) - Lnib,u)\ + \Ln{b,u) -Qn{b,u)\ 

< cj)Sn^i + ep\V\Rn,,^ (l + 3(f,^^) . (26) 

The last step follows from inequalities (p4jl and p5|) D 



Proposition 4. Given our agent's prediction TT-t{s, v) = Pr{v\s) of its counterpart's mixed strategy 
(i.e., agent -i's mixed strategy) with prediction error ep, the incurred error (5„ of the corresponding 
optimal value function Vn, with respect to Pr{v\s), is linearly bounded by ep: 



Sn < ep\V\R„ 



Proof: By Lemma 1, we have: 
Also, Proposition 3 shows that: 



Sn < maxmax IQ* (&, m) — (3n(&, u)| . (27) 

h u 



\Ql{b,u) - QrXb,u)\ < (l)6n-i + ep\V\R^^^ (l + 3(1)^^) 
l |28] l, we have: 

<5„ < ^(5„_i +ep|y|i?inax I 1+30- 



(28) 



From ( |27| l and pg) , we have: 

1- 

< 0"-i5^ + 'P'f'^"'"" f 1 + 34>^) (expanding the recurrence) 

1-0 \ 1-0/ 

< 0"-i6,|y|i?„,a. + 'J^l^ (^i + 30^) . (29) 

When n — > oo, 0"~^ep|T^|i?max —^ 0. Thus, with large value of n the incurred error is approxi- 
mately bounded by ^'^ lirf,""" ( 1 + '^'^i^ ) ■ Consequently, when Cp — >■ 0, (5„ — > 0. In general, this 
implies that the prediction error is linearly proportional with the incurred error of the corresponding 
value function D 
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D.3 Main Theorems 

Theorem 2. Let T^o be the value function for infinite time horizon. Then, we have 

||Ko-K+l||oo<'/'||K>o-V;||oo- 

Proof: We have 

\V^ib) - Vn+i{b)\ < max|goo(6,u)-Qn+i(fo,M)| (30) 

U 

< max(j)yPr{v,o\b,u)\Voc,{b')-Vnib)\ (31) 

v.o 

< max(/.VPr(v,o|6,u)|114o-14|joo (32) 

U ^ ^ 

= 0||Ko - VnWoo max Y] Pr(f , o\b, u) (33) 

= 0||14o-K||oo (34) 

The last equation completes our proof. D 

Theorem 3. The optimal value function y„ of I-POMDP Lite is a piecewise-linear and convex 
function represented as a finite set of a vectors: 

Vn{b) = max(a6) . (35) 

Proof: For n = 1, it can be verified that the set of a vectors Vn are simply the weighted average 
immediate payoffs: 



Vi 



lau I u e [/, Vs G 5, au{s) = 2, Pr{v\s)R{s, u,v) > 



Inductively, let us assume that Vn is a piecewise-linear and convex function up to n = k. We need to 
show that V/c+i is also a piecewise-linear and convex function, represented as a finite set of a vectors. 

To simplify the notation, let us denote R{s,u) — ^^^ Pr(u|s)i?(s, u, t;) and b' = B{b,u,v,o). 
Also, let us index the a vectors of Vfc with the setof numberings / = {1, 2, . . . , |T4|}. Subsequently, 
we have 

Vk+i{b) = max (^6(s)i?(s,u) + (/>^Fr(w,o|6,w)max^5'(s')a-(s') I . (36) 

\ s v,o s' / 

Recall that the belief update step is defined as 

b'{s') = „ , V, . Z{s', u, o) V T{s, u, V, s')Priv\s)b{s) . 
Fr(v,o\b,uj ^ — ' 

Let /' be the unnormalized version of b', we have 

l'{s') = Z{s',u,o)yT{s,u,v,s')Pr{v\s)b{s) . (37) 

s 

Using equation ( |37| ), we can simplify equation ( (36| ) as 

Vk+iib) = max (^&(s)i?(s,u) + <?!>^max^a-(s')/'(s') ) 

\ s v,o s' J 

= max > b{s)R{s,u) + (j) max max ... max > a^ (s')^'(s') 

u \^-^ a;i,ie/xi.2e/ x,v\,\o\^I -^-^ , "'" / 

\ s v,o,s' I 

= max max ... max > b{s)R{s,u) + (f> > a'^ {s')l'{s'] 

u xi^iGl 2;iv|,|0|e/ \ '^-^ ^-^ ^ 

\ s v,o,s' 



(38) 
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Substitute equation ( |37| ) into ( (38] l, we obtain 



Vfe-(-i(6) = max max . . . max > b{s) \ R{s , u) + (j) } a'^ {s')Z{s' ,u,o)T{s,u,v,s')Pr{v\s) 

u xi^i£l x\v\.\o\eI \ ^-^ \ ^-^ , 

\ s \ v,o,s' 

The last equation shows that Vfc+i(6) is a piecewise-Hnear and convex function. Essentially, it is 
equivalent to equation ( (35] l. D 

This proof also implies an exponential increase in the number of a vectors after each back-up oper- 
ation. Intuitively, for each tuple (m, xi i, . . . , a;|y| |o|), we can compute a new a vector for 14+1. 
Thus, the number of a vectors needed to exactly represent 14+1 is equal to the number of those 
tuples. Since the domain value for each variable Xi,j is the set of integers from / = {1, 2, . . . , |Vfc|} 
and there are |V^||0| of those variables, we have |C/||Vfc|l^ll'^l of such tuples. Consequently, there 
will be iV/c+il = |C/||T4|I^II*-'I a vectors generated to represent T4+i. This explains why exact 
back-up operation with respect to the whole belief simplex is infeasible in practice. 

Theorem 4. The performance loss (5„ = || J!* — J„||oo incurred by executing I-POMDP Lite policy, 
induced with respect to the predicted mixed strategy 7r-t(s, v) — Pr(y\s) of agent -t (as compared 
to its true mixed strategy 7r*j(s,w) = Pr*(y\s)), after n backup steps is linearly bounded by the 
prediction error tp. 



\K-Jn\U < 2ep|F|i?„ 



in-l 



1 A 30|O| 



Proof: Let us define /„(&) as the expected total reward if our agent follows the optimal policy tt 
computed with respect to our predictive distribution Pr{v\s), and if the other agent's true mixed 
strategy is exactly Pr{v\s). We have: 

In(b) = Y,My\s)R{s,Ab),v)b{s) + <^^Fr(«,o|5,^(6))/„_i(5') . 

s,v v,o 

withb' ^B{b,TT{b),v,o). 

It can be trivially verified that In ^ Vn- Now, we have 

iJm - JrM < \j:{b) ~ In{b)\ + |/„(6) - Jn{b)\ 

= \v:{b)-Vn{b)\ + \ub)-ub)\ 

< epC + \In{b) - J„(5)| (Proposition 4) . (39) 

with C is a constant that represents for: 

C = r-'\V\R.... + %^ (l + ^) . (40) 

For convenience, let us denote from now on that u = 7r(6) and b* — B*{b,7T{b),v,o). Thus, we 
have 

\Inib) - Jnib)\ < ^6(5)i?(s,u,z;)|Pr*(«|s)-Pr(«|s)| + 

S,V 

^J2\Pr*{v,o\b,u)Jn-i{b*) - Pr{v,o\b,u)In-i{b')\ 

< 0^|Pr*(«,o|5,7/)J„_i(6*)-Pr(t;,o|6,«)/„_i(5')| + 

i?max|V'|ep(Lemma2) (41) 
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Next, we have 

\Pr*{v,o\b,u)Jn-iib*) - Pr{v,o\b,u)I,,_i{b')\ < Pr*(t., o|&, u)| J„_i(&*) - /„_i(6')| + 

In-iib')\Pr* {v, o\b, u) - Pr{v, o\b, u)\ 

< Pr*(v,o|6,u)|J„_i(fe*)-/„_i(6')| + 
In-i{b')€p (Lemma 3) 

< Pr*(i;,o|6,u)|J„_i(6*)-/„_i(6')| + 

The above last step follows because the expected total reward /„_i(fe') is always bounded by '^"? . 
Let us denote 7„ = max;, |/„(&) — Jri(^)|, we have 

Pr*(«,o|6,u)|J„_i(6*)-/„_i(6')| < Pr*(z;,o 

< Pr*{v,o 

< Pr*{v,o 

< Pr*{v,o 

< Pr*{v,o 

< Pr*{v,o 
= Pr*{v,o 



b,u) (I J„-l(6*) - In-i{b*)\ + \In-i{b*) - /„-i(6')l) 
b,u) (I J„_i(6*) - /„_i(6*)| + \Vn-i{b*) - K-i(&')l) 

6,u)('|J„_i(6*)-/„_i(6*)| + ^||6*-fe'||') (Lemma 4) 
b,u) L^^r + f^llfe* -fo'll) (Def. of7„) 
6,«)7n-i + ^Pr*(i;,o|6,u)||&*-5'|| 

1-0 

6, u)7„-i H ^^^2ep (Proposition 1) 

1-0 



Substitute d431l into (H2ll, we have 



(44) 



6, u)7„_i + 2ep- - 

1-0 

|Pr*(w, o|6, u) J„-i(6*) - Pr{v, o% u)U^^{b')\ < Pr*{v, o\b, u)jn-i + 3ep^^ 

1-0 

Substitute inequality (|44]) into (|4T]i, we have 

\In{b) - Jn{b)\ < i?„,ax|V^|ep + 0^('pr*(«,o|6,M)7„_i+3ep^ 

v,o ^ 

< i?max|1^|ep + 0^Pr*(z;,o|6,u)7„_i+30|l/||O|^^ep 

< 07„_i + i?,nax|'^^|ep + 30|V^||O|^^ep. (45) 
Since inequality ( |45| l holds for all 5, we have the recurrence equation 

7n < '/>7n-l + ep|V"|Pmax U + ^^ 

ep |1^|Pmax / 30|O| \ 

1-0 V 1-0/ 



< 0" 7i + -^-- 7 — 1 H 7 (expand the recurrence) 



< A"-l n/lfi. gpl^l^max A , 30|O|\ 



Rm..\V\ 



= £pf0"-^|y|En,ax+ 7^"'^' (1 

= EpC. (46) 

From (|46|, it is obvious that |/„(6) — J„(5)| < 7„ < CpC. Hence, plugging it into inequality ([39|, 
we have 

I J: (6) - J„ (fe) I < epC + |/„ (5) - J„ (6) I < 2epC . (47) 
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(43) 



Since inequality ( |47| i holds for all b, we have: 

1 1 J: - Jn 1 1 oo = max I J,: (b) ~ Jnib)\ < 2epC . (48) 

b 

Finally, we complete the proof by substituting equation (|40]) into inequality ( |48| ): 



24 



